Hierarchical tiling for improved superscalar performance

نویسندگان

  • Larry Carter
  • Jeanne Ferrante
  • Susan Flynn Hummel
چکیده

It takes more than a good algorithm to achieve high performance: inner-loop performance and data locality are also important. Tiling is a well-known method for parallelization and for improving data locality. However, tiling has the potential of being even more beneecial. At the nest granularity, it can be used to guide register allocation and instruction scheduling; at the coarsest level, it can help manage magnetic storage media. It also can be useful in overlapping data movement with computation, for instance by prefetching data from archival storage, disks and main memory into cache and registers, or by choreographing data movement between processors. Hierarchical tiling is a framework for applying both known tiling methods and new techniques to an expanded set of uses. It eases the burden on several compiler phases that are traditionally treated separately , such as scalar replacement, register allocation, generation of message passing calls, and storage mapping. By explicitly naming and copying data, it takes control of the mapping of data to memory and of the movement of data between processing elements and up and down the memory hierarchy. This paper focuses on using hierarchical tiling to exploit superscalar pipelined processors. On a simple example, it improves performance by a factor of 3, achieving perfect use of the superscalar processor's pipeline. Hierarchical tiling is presented here as a method of hand-tuning performance; while outside the scope of this paper, the ideas can be incorporated into an automatic preprocessor or optimizing compiler.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Portable Compilation of Vector Expressions for Architectures with Memory Hierarchy

The paper presents a scheme of code generation for vector expressions implemented in the CC] compiler (CC] is a vector ANSI C superset aimed at vector and superscalar architectures). The scheme is based on two well-known optimization techniques { loop invariant code motion and iteration space tiling. The problem of nding the optimal tile size for the imperfectly nested loop system implementing ...

متن کامل

Analysis of the Task Superscalar Architecture Hardware Design

In this paper, we analyze the operational flow of two hardware implementations of the Task Superscalar architecture. The Task Superscalar is an experimental task based dataflow scheduler that dynamically detects inter-task data dependencies, identifies task-level parallelism, and executes tasks in the out-of-order manner. In this paper, we present a base implementation of the Task Superscalar a...

متن کامل

Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure

We improve the performance of sparse matrix-vector multiply (SpMV) on modern cache-based superscalar machines when the matrix structure consists of multiple, irregularly aligned rectangular blocks. Matrices from finite element modeling applications often have this kind of structure. Our technique splits the matrix, A, into a sum, A1 + A2 + . . . + As, where each term is stored in a new data str...

متن کامل

Development of efficient computational kernels and linear algebra routines for out-of-order superscalar processors

We present methods for developing high performance computational kernels and dense linear algebra routines. The microarchitecture of AMD processors is analyzed with the goal to achieve peak computational rates. Approaches for implementing matrix multiplication algorithms are suggested for hierarchical memory computers. Block versions of matrix multiplication and LUdecomposition algorithms are c...

متن کامل

An application specific multi-port RAM cell circuit for register renaming units in high speed microprocessors

We present a novel custom circuit for superscalar microprocessor renaming unit and compare its performance with a conventional design, referring to an industrial 0.35 μm CMOS process. Speed and power consumption are significantly improved.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995